YouTube Community Contributions
Item Preview
Share or Embed This Item
YouTube Community Contributions
- Publication date
- 2020-10-28
- Topics
- youtube, metadata, community contributions, closed captions, subtitle, subtitles, youtube videos, title, description, translation, credits, accessibility
- Item Size
- 3.9G
YouTube Community Contributions allowed users to create and translate closed captions/subtitles, titles, and descriptions of YouTube videos uploaded by channels who enabled the feature. Users could optionally choose to be credited for their captioning contributions.
This archive contains YouTube Community Contributions data, including draft, published, and uploader-provided content, from 406,394 videos, found from scanning over 50 million videos. It includes caption data in SBV format (intended to match the format used when downloading captions from the editor), metadata (titles and descriptions) in JSON format, and caption credits in JSON format. For each video, captions and metadata have one file per language with contribution data, while caption credits have data for all languages in the same file.
You can search the archive using a YouTube link or video ID. Some additional statistics and analysis, as well as PNG and HTML snapshots of the community contributions user interface, are also available.
Note for users browsing the compressed files online: Due to a bug in the Internet Archive, attempting to download an individual file from a video with a dash (-) as the first character of its ID may receive a blank file. As an alternative, these users can browse and download these files by using the compressed archives in the “~-” folder, which is a copy of the “-” folder with a tilde (~) symbol prepended which prevents the blank file bug.
Additional technical details useful for those working with the full dataset can be found below:
GENERAL ORGANIZATION OF THE DATA
The root directory of this item contains 64 folders to match the first character of the IDs of the video data it contains. Within each folder there are 64 ZIP archives, each named to match the first two characters of the IDs of the video data it contains. Each ZIP archive is approximately 1MB in size and contains data for approximately 100 videos. Within each ZIP archive there are folders named for the IDs of the video data it contains. These folders contain the following types of files, named according to the following rules (brackets are used to indicate variables which must be replaced with their appropriate values):
Metadata - Community Draft: [video ID]_[language code]_community_draft[_alternate version].json
Captions - Community Draft: [video ID]_[language code]_community_draft[_alternate version].sbv
Metadata - Community Published: [video ID]_[language code]_community_published.json
Captions - Community Published: [video ID]_[language code]_community_published[_alternate version].sbv
Metadata - Uploader Provided: [video ID]_[language code]_uploader_provided[_alternate version].json
Captions - Uploader Provided: [video ID]_[language code]_uploader_provided[_alternate version].sbv
Caption Credits - Community Published: [video ID]_published_credits.json
Video IDs must represent a valid YouTube video ID. In the case of this collection, all retrieved video IDs are 11 characters in length. Video IDs are case sensitive, and only the following characters are allowed: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_
Language codes represent the standard short codes to represent a language or a regional variation of it. The 196 language codes are allowed and included in the dataset. The language codes, as well as their English language conversions, are listed in the appendix.
Alternate versions are optional and only appear on a relatively small number of files. If more than one non-identical version of a file was retrieved, the first version is named as normal with no number appended to its filename. All subsequent versions are named as normal, but an underscore followed by the alternate version number are inserted into the filename before the file extension. Alternate version numbers start with 1 and increase as needed to accommodate all non-identical alternate versions of a file.
SIZE OF THE DATASET
This dataset is approximately 3.83GB compressed. Decompressed, the dataset is 9.46GB, but since the dataset contains a large number of small files, it may have a larger size on disk such as 11.7GB. Combined, the 4096 ZIP archives contain 406,394 folders and 1,361,998 files.
ADDITIONAL CONSIDERATIONS WHEN WORKING WITH THIS DATASET
When working with more than a few files from this dataset it is recommended to use a case-sensitive folder or filesystem. Linux filesystems are typically case-sensitive, but macOS and Windows filesystems are typically case-insensitive. On Windows 10 with the October 2018 update or later, you can enable case-sensitivity on a per-directory basis by opening a command line window, navigating to the folder you wish to make case-sensitive, and running “fsutil file setCaseSensitiveInfo . enable”. On macOS, you would need to create a new volume with a case-sensitive filesystem using Disk Utility.
Due to the large number of files and folders contained in the full dataset, it may also be advisable to disable search indexing for the folder in which the dataset is extracted.
ADDITIONAL FILES IN THIS ITEM
This item also contains some additional files aside from the ones described above. In the root directory, search.html and data_index.json are files used by the searchable index mentioned before. The ~- subdirectory in the root folder contains an additional copy of all of the files relating to videos with IDs which start with a dash symbol (-). This directory follows the same rules as previously described, but all files, folders, and archives have a ~ symbol prepended to their names. This directory was created because a bug in the Internet Archive infrastructure does not allow directly linking to files in ZIP archives where the filename starts with a dash symbol (-). Additionally, the root directory contains a torrent for the data: youtube-community-contributions_archive.torrent, as well as some system files: youtube-community-contributions_files.xml, youtube-community-contributions_meta.sqlite, and youtube-community-contributions_meta.xml, which are generated by the Internet Archive.
WORKING WITH THE DATA INDEX
A JSON-formatted index of the data, as used by the searchable index, is available in data_index.json. This index can be used to determine the availability of a particular piece of data. It can also be used with the previously-described naming rules to generate folder paths and filenames for all of the files in this data.
The index file contains a JSON dictionary. The root dictionary has keys for each of the video IDs that have data in this archive. The values of each of these keys is another dictionary. The possible keys of this dictionary and their meanings are as follows:
m: Metadata - Community Draft
c: Captions - Community Draft
n: Metadata - Community Published
d: Captions - Community Published
o: Metadata - Uploader Provided
e: Captions - Uploader Provided
f: Caption Credits - Community Published
None of the keys are guaranteed to exist for a given video ID; the keys only exist if data of that type is available for the given video ID in this archive. The value for all of these keys (except for key f) is a list of language codes. If a language code appears more than once in the list, the second occurrence and all additional occurrences are considered alternate versions of that file, and should have an underscore followed by the alternate version number inserted into the filename before the file extension. The first version of a file never receives an underscore. The lowest alternate version number for a file is 1. The only value for key f is 1, which indicates the availability of caption credits.
HOW THE ARCHIVE WAS CREATED
The archive was created by using scripts which retrieved published, draft, and uploader provided captions and metadata from YouTube’s Community Contributions editor. The scripts also retrieved published credits from YouTube’s video watch page.
For each video, the script scanned the webpages for all 196 languages which were supported by YouTube’s Community Contributions feature. Caption files were created by parsing the HTML source code of the Community Contributions caption editor and generating an SBV file which matched the format used by the download captions button in the editor (the download captions button used JavaScript so the SBV file generation logic had to be reimplemented). Metadata files were created by parsing the HTML source code of the Community Contributions metadata editor and generating a JSON-formatted file. Credits files were created by parsing the HTML source code of the video watch page and generating a JSON-formatted file.
The main project was coordinated by an item tracker which was provided by Archive Team. Upon completion of each video, all videos, channels, playlists, and mix playlists that were included on the YouTube video watch page were submitted to the tracker, allowing discovery and retrieval of data for a wide selection of videos on YouTube. By the end of the project, the tracker reported 51,934,967 completed items (including videos, channels, playlists, and mix playlists), and an additional 344,553,220 items which were discovered but not completed by the time the Community Contributions editor was made inaccessible. (Note: the number of items reported completed by the tracker may be slightly higher than the number of items for which data was saved because there was a bug in the initial version of the project where data wasn’t submitted correctly. This bug affected up to 1.2 million videos which were scanned in the first hours of the project, though some of these videos were later re-scanned manually; see below.) Ultimately, the tracker project returned data for 349,112 videos (videos with no Community Contributions data, as well as channel, playlist, and mix playlist items, did not return any data).
Additionally, several contributors in the community ran the scripts with their own lists of videos and channels. Combined, these contributors provided data for 92,732 videos.
This means that overlapping data was returned for 35,450 videos. For most videos, the data was identical, but for a few videos some data files were not identical, and alternate versions were created in the dataset.
The script run by volunteers that retrieved and completed items from the tracker, and the script that manually retrieved data from specified YouTube videos and channels, are available on GitHub. Please note, however, that these scripts are no longer functional because the YouTube Community Contributions editor has been made inaccessible.
Credits data for published captions was collected between September 23, 2020 and approximately October 14, 2020 when credits data for published captions were made inaccessible. The captions, titles, and descriptions data (community draft, community published, and uploader-provided versions) were collected between September 23, 2020 and October 28, 2020, when the Community Contributions editor was made inaccessible.
-----
Thank you to everyone who contributed to this project!
Feel free to join us on Discord!
APPENDIX: LIST OF SUPPORTED LANGUAGE CODES AND THEIR ENGLISH LANGUAGE NAMES
aa | Afar |
ab | Abkhazian |
af | Afrikaans |
am | Amharic |
ar | Arabic |
arc | Aramaic |
as | Assamese |
ase | American Sign Language |
ay | Aymara |
az | Azerbaijani |
ba | Bashkir |
be | Belarusian |
bg | Bulgarian |
bh | Bihari |
bi | Bislama |
bn | Bangla |
bo | Tibetan |
br | Breton |
bs | Bosnian |
ca | Catalan |
cho | Choctaw |
chr | Cherokee |
co | Corsican |
cs | Czech |
cy | Welsh |
da | Danish |
de | German |
de-AT | German (Austria) |
de-CH | German (Switzerland) |
de-DE | German (Germany) |
dz | Dzongkha |
el | Greek |
en | English |
en-CA | English (Canada) |
en-GB | English (United Kingdom) |
en-IE | English (Ireland) |
en-IN | English (India) |
en-US | English (United States) |
eo | Esperanto |
es | Spanish |
es-419 | Spanish (Latin America) |
es-ES | Spanish (Spain) |
es-MX | Spanish (Mexico) |
es-US | Spanish (United States) |
et | Estonian |
eu | Basque |
fa | Persian |
fa-AF | Persian (Afghanistan) |
fa-IR | Persian (Iran) |
ff | Fulah |
fi | Finnish |
fil | Filipino |
fj | Fijian |
fo | Faroese |
fr | French |
fr-BE | French (Belgium) |
fr-CA | French (Canada) |
fr-CH | French (Switzerland) |
fr-FR | French (France) |
fy | Western Frisian |
ga | Irish |
gd | Scottish Gaelic |
gl | Galician |
gn | Guarani |
gu | Gujarati |
ha | Hausa |
hak | Hakka Chinese |
hak-TW | Hakka Chinese (Taiwan) |
hi | Hindi |
hi-Latn | Hindi |
ho | Hiri Motu |
hr | Croatian |
ht | Haitian Creole |
hu | Hungarian |
hy | Armenian |
ia | Interlingua |
id | Indonesian |
ie | Interlingue |
ig | Igbo |
ik | Inupiaq |
is | Icelandic |
it | Italian |
iu | Inuktitut |
iw | Hebrew |
ja | Japanese |
jv | Javanese |
ka | Georgian |
kk | Kazakh |
kl | Kalaallisut |
km | Khmer |
kn | Kannada |
ko | Korean |
ks | Kashmiri |
ku | Kurdish |
ky | Kyrgyz |
la | Latin |
lb | Luxembourgish |
ln | Lingala |
lo | Lao |
lt | Lithuanian |
lus | Mizo |
lv | Latvian |
mas | Masai |
mg | Malagasy |
mi | Maori |
mk | Macedonian |
ml | Malayalam |
mn | Mongolian |
mni | Manipuri |
mo | Moldavian |
mr | Marathi |
ms | Malay |
mt | Maltese |
my | Burmese |
na | Nauru |
nan | Min Nan Chinese |
nan-TW | Min Nan Chinese (Taiwan) |
ne | Nepali |
nl | Dutch |
nl-BE | Dutch (Belgium) |
nl-NL | Dutch (Netherlands) |
no | Norwegian |
nv | Navajo |
oc | Occitan |
om | Oromo |
or | Odia |
pa | Punjabi |
pl | Polish |
ps | Pashto |
pt | Portuguese |
pt-BR | Portuguese (Brazil) |
pt-PT | Portuguese (Portugal) |
qu | Quechua |
rm | Romansh |
rn | Rundi |
ro | Romanian |
ru | Russian |
ru-Latn | Russian |
rw | Kinyarwanda |
sa | Sanskrit |
sc | Sardinian |
scn | Sicilian |
sd | Sindhi |
sdp | Sherdukpen |
sg | Sango |
sh | Serbo-Croatian |
si | Sinhala |
sk | Slovak |
sl | Slovenian |
sm | Samoan |
sn | Shona |
so | Somali |
sq | Albanian |
sr | Serbian |
sr-Cyrl | Serbian (Cyrillic) |
sr-Latn | Serbian (Latin) |
ss | Swati |
st | Southern Sotho |
su | Sundanese |
sv | Swedish |
sw | Swahili |
ta | Tamil |
te | Telugu |
tg | Tajik |
th | Thai |
ti | Tigrinya |
tk | Turkmen |
tl | Tagalog |
tlh | Klingon |
tn | Tswana |
to | Tongan |
tpi | Tok Pisin |
tr | Turkish |
ts | Tsonga |
tt | Tatar |
tw | Twi |
uk | Ukrainian |
ur | Urdu |
uz | Uzbek |
vi | Vietnamese |
vo | Volapük |
vor | Voro |
wo | Wolof |
xh | Xhosa |
yi | Yiddish |
yo | Yoruba |
yue | Cantonese |
yue-HK | Cantonese (Hong Kong) |
zh | Chinese |
zh-CN | Chinese (China) |
zh-HK | Chinese (Hong Kong) |
zh-Hans | Chinese (Simplified) |
zh-Hant | Chinese (Traditional) |
zh-SG | Chinese (Singapore) |
zh-TW | Chinese (Taiwan) |
zu | Zulu |
- Access-restricted-item
- true
- Addeddate
- 2021-01-25 23:11:55
- Identifier
- youtube-community-contributions
- Noindex
- true
- Scanner
- Internet Archive Python library 1.9.4
- Year
- 2020
comment
Reviews
650 Views
3 Favorites
DOWNLOAD OPTIONS
IN COLLECTIONS
Archive Team: YouTubeUploaded by tech234a on